A cascaded approach to normalising gene mentions in biomedical literature

نویسندگان

  • Hui Yang
  • Goran Nenadic
  • John A Keane
چکیده

Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%.

منابع مشابه

Species taxonomy for gene name normalization

Background: The task of gene normalization is to assign a unique identifier from a database to the gene mentions. Using these identifiers a great deal of information can be gathered from external databases such as interactions, pathways, sequences and protein structures. Normalizing gene mentions in articles is a difficult task as the inter-species ambiguity of the gene mentions in biomedical p...

متن کامل

Species taxonomy for gene normalization

Background: The task of gene normalization is to assign a unique identifier from a database to the gene mentions. Using these identifiers a great deal of information can be gathered from external databases such as interactions, pathways, sequences and protein structures. Normalizing gene mentions in articles is a difficult task as the inter-species ambiguity of the gene mentions in biomedical p...

متن کامل

Inter-Event Dependencies support Event Extraction from Biomedical Literature

The description of events in biomedical literature often follows discourse patterns. For example, authors may firstly mention the transcription of a gene, and then go on to describe how this transcription is regulated by another gene. Capturing such patterns can be beneficial when we want to extract event mentions from literature. For instance, detecting the mention of a transcription of gene A...

متن کامل

Assigning roles to protein mentions: The case of transcription factors

Transcription factors (TFs) play a crucial role in gene regulation, and providing structured and curated information about them is important for genome biology. Manual curation of TF related data is time-consuming and always lags behind the actual knowledge available in the biomedical literature. Here we present a machine-learning text mining approach for identification and tagging of protein m...

متن کامل

Comparing Usability of Matching Techniques for Normalising Biomedical Named Entities

String matching plays an important role in biomedical Term Normalisation, the task of linking mentions of biomedical entities to identifiers in reference databases. This paper evaluates exact, rule-based and various string-similarity-based matching techniques. The matchers are compared in two ways: first, we measure precision and recall against a gold-standard dataset and second, we integrate t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:
  • Bioinformation

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2007